Quant Bootcamp

Sahil Deo and Jayati Sharma

Effective Data Visualization - Why

  • Data visualization is not a tool, it is a means for communicating results
  • The communicator needs to bring the story visually and contextually to life
  • Requires skill to make graphs that conveys the in an effective manner
  • You want to make the entire audience take away the same thing from your visualization
  • To understand what effective visualizations are, let us take a look at some examples which are not effective

Say no to 3D graphs!

  • 3D graphs makes the visualizations redundant and complex for no reason
  • The bars at the front hide the ones at the back.
  • Other than looking flashy, they almost never serve the purpose of conveying information
  • The tilt makes it even more difficult to read the values

The more, the merrier?

  • This graphic tries to show MLA salaries
  • The information overload makes it look aesthetically unpleasant
  • With all the information, it looks difficult to understand
  • Provided with such a visual, the audience would not be able to understand anything of substance

Source: jotform

Misleading Visualization

  • Massive growth in iPhone sales, isn’t it?
  • However, a closer look shows that the sales are cumulative over time
  • Cumulative sales don’t necessarily show growth
  • Moreover, the graph does not have a scale on the y-axis
  • It isn’t clear what the graph is trying to show

Source: syntaxtechs

Avoid Pie Charts

  • Pie chart, in most cases, are not the best approach to visualize data.
  • Not only do they add up to 100% in this case, they also make it difficult to comprehend which component has the biggest share
  • When you choose a donut chart, you are essentially asking your audience to measure the arc length to see which has the biggest share
  • Redundant, no? A simple bar chart would have done the job

Source: jotform

Always a zero baseline

  • There seems to be a major difference between democrats and republicans for percent of people who agreed with court
  • However, at a closer look, the visualization has been made with a baseline of 50% which means that the difference is only around 8%
  • Graphs must always be made with a zero baseline to not exaggerate the scale

Source: syntaxtechs

Data Visualization in R

What?

  • Converting large amounts of data into easily understood information
  • This graphical representation of data is data visualization

Why?

  • While exploring your data, you would want to see associations between variables to understand patterns in data
  • For effective presentation of data

The ggplot2 package

Content for this topic has been sourced from Winston Chang’s ‘R Graphics Cookbook, 2nd edition’. Please check out the work for detailed information.

  • ggplot2 takes a different approach to graphics than other plotting packages in R
  • “Grammar of graphics” - provides a formal, structured perspective on how to describe data graphics

Loading the packages

  • tidyverse is a collection of some R packages
  • If you load tidyverse, no need to load ggplot2 separately
library(tidyverse)
library(ISLR2)
  • Load the diamonds dataset from ggplot2

ggplot2 Terminology

Content for this topic has been sourced from Winston Chang’s ‘R Graphics Cookbook, 2nd edition’. Please check out his work for detailed information.

Some of the terminologies used in ggplot2:

  • data- what we want to visualize and consists of variables
  • Geoms - geometric objects that are drawn to represent the data, such as bars, lines, and points
  • aesthetics - visual properties of geoms, such as x and y position, line color, point shapes, etc
  • There are mappings from data values to aesthetics

Building a plot

Task - You want to plot a boxplot of cut and depth

ggplot(data = diamonds) #your plot area

Building a plot

Task - You want to plot a boxplot of cut and depth

ggplot(data = diamonds) + # your plot area
  geom_boxplot(aes(cut, depth),
               fill = 
                 "lightyellow") #your geom and aesthetic

Building a plot

Task - You want to plot a boxplot of cut and depth

ggplot(data = diamonds) + # your plot area
  geom_boxplot(aes(cut, depth),
               fill = "lightyellow") + #your geom and aesthetic
  theme_minimal() #adding the theme

Building a plot

Task - You want to plot a boxplot of cut and depth

ggplot(data = diamonds) + # your plot area
  geom_boxplot(aes(cut, depth),
               fill = "lightyellow") + #your geom and aesthetic
  theme_minimal() + #adding the theme
  labs(x = "Cut of Diamonds",
    y = "Depth",
    title = "Total Depth Percentage by Cut"
    )# labels

A Graphing Template

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science’. Please check out his work for detailed information.

  • A basic template for plotting a graph through ggplot2 can be
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Plotting a….

Continuous Variable

Plotting histogram of carat

ggplot(data = diamonds) +
  geom_histogram(aes(x = carat))

Discrete Variable

Plotting a bar chart for cut

ggplot(data = diamonds) +
  geom_bar(aes(x = cut))

Scatterplots - Plotting two continuous variables

  • Load the Auto dataset from ISLR2 package

Task - You want to plot a scatterplot of mpg and displacement

ggplot(data = Auto) +
  geom_point(aes(x = mpg, y = displacement)) 

Aesthetics

Auto$origin <- as.factor(Auto$origin)
  
ggplot(data = Auto) +
  geom_point(aes(x = mpg, y = displacement,
                 shape = origin)) #shows different origin by shapes

ggplot(data = Auto) +
  geom_point(aes(x = mpg, y = displacement,
                 colour = origin)) #shows different origins by colours

ggplot(data = Auto) +
  geom_point(aes(x = mpg, y = displacement,
                 shape = origin),
             alpha = 0.5) #for setting the opacity

Bar Charts - Plotting a discrete and a continuous variable

Task - You want to plot the mean number of carat by cut from diamonds dataset

diamonds |>
  group_by(cut) |>
  summarise(mean_carat = mean(carat)) |>
  ggplot(aes(x = cut, y = mean_carat))+
  geom_col()

Formatting your Plor

p1 <- diamonds |>
  group_by(cut) |>
  summarise(mean_carat = mean(carat)) |>
  ggplot(aes(x = cut, y = mean_carat))+
  geom_col(width = 0.70, fill = "#9ac5db")

p1

p2 <- diamonds |>
  group_by(cut) |>
  summarise(mean_carat = mean(carat)) |>
  ggplot(aes(x = reorder(cut, mean_carat), y = mean_carat))+
  geom_col(width = 0.70, fill = "#9ac5db")

p2

p3 <- diamonds |>
  group_by(cut) |>
  summarise(mean_carat = mean(carat)) |>
  ggplot(aes(x = reorder(cut, mean_carat), y = mean_carat))+
  geom_col(width = 0.70, fill = "#9ac5db")+
  labs(title = "'Fair' Cut Has the Highest Mean Carat Value",
       subtitle = "Mean Carat Value by Quality of the Cut",
       y = "Mean Carat",
       x = "Quality of the Cut",
       caption = "Data Source : diamonds | Analysis by Particpant")

p3

p4 <- diamonds |>
  group_by(cut) |>
  summarise(mean_carat = mean(carat)) |>
  ggplot(aes(x = reorder(cut, mean_carat), y = mean_carat))+
  geom_col(width = 0.70, fill = "#9ac5db")+
  labs(title = "'Fair' Cut Has the Highest Mean Carat Value",
       subtitle = "Mean Carat Value by Quality of the Cut",
       y = "Mean Carat",
       x = "Quality of the Cut",
       caption = "Data Source : diamonds | Analysis by Particpant")+
  coord_flip()+
  theme_minimal()

p4

Patchwork - Combine your plots

  • After making all your plots, you can combine them using patchwork
library(patchwork)
  • It simply combines all your plots the way you want
p1 + p2 + p3+ p4

Patchwork - Combine your plots

  • You can combine your plots in many ways
p1 / p4

  • Annotations can be also added
my_patchwork <- (p1 + p2)/(p3 + p4)

my_patchwork + plot_annotation(
  title = 'How I formatted my plots',
  subtitle = 'I used patchwork to format my plots',
  caption = 'Author: Participant')